ggplot2ggplot2 is a general purpose library for visualizing
data. The key principle of ggplot2 is the idea of building
up a plot by combining different layers, each responsible for a specific
function. You can learn more about ggplot2 philosophy here but
it’s probably best to start with an example. We will be using the
following classic dataset:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We wish to plot sepal length against sepal width.
ggplot(data=iris)+ # Notice that we use + and not the pipe %>%
geom_point(mapping=aes(x=Sepal.Length,y=Sepal.Width))We specify that the data we want to use is in the iris
object. Then we add our first layer geom_point and tell
ggplot2 that Sepal.Length should be mapped to
the x-axis and Sepal.Width should be mapped to the
y-axis.
Congratulations! This is your first graph. Suppose we want to change point shape based on the species.
ggplot(data=iris) + # Notice that we use + and not the pipe %>%
geom_point(mapping=aes(x=Sepal.Length,y=Sepal.Width,
shape=Species))We just need to map the Species variable to the shape
aesthetic.
If we felt that the shape alone is not enought to help the viewer separate points by species, we could add color:
plot <- ggplot(data=iris,
mapping=aes(x=Sepal.Length,y=Sepal.Width) )+
geom_point(mapping=aes(shape=Species,color=Species))
plotSuppose we want to focus on a specific portion of the graph:
coord_cartesian controls the coordinates and can zoom on
a part of the plot. Notice that we have stored our previous plot into
the plot object and we are now adding more layers to it
with the + operator.
We can change the axis labels and add a title:
We could also add a regression line:
Maybe it would be better to have each species in a separate graph:
facet_grid allows to specify which variables should be
used in the rows and which in the columns.
We are not confined to points. Here is a histogram of sepal length.
Here a fancier violin plot by species.
This graph is also useful to show the difference between specifying
color and fill within a call to
aes and outside of the call.
When we specify color or fill within
aes, ggplot expects a variable and tries to
map this variable to the desired aesthetic. Here is an example:
Here ggplot tried converting ‘white’ and ‘black’ into
variables and, under the hood, created two temporary variables with the
same number of values as there are rows in iris assigning
the value ‘white’ to one and ‘black’ to the other. This behaviour can be
exploited to avoid having to reshape our data.
Here is an example of how we can trick ggplot into
letting you use data in the “wrong” shape.
iris %>%
ggplot() +
geom_violin(mapping=aes(x='Sepal Lenght',y=Sepal.Length,fill='Sepal Lenght')) +
geom_violin(mapping=aes(x='Sepal Width',y=Sepal.Width,fill='Sepal Width')) +
labs(x='',y='Centimiters',fill='')Ideally, ggplot would want us to reshape the data into
long format where Sepal Length and Sepal Width are values of a new
variable type and the corresponding values are stored into
a new variable called value.
Beyond what we have seen so far, there are many different geometries you can use:
geom_linegeom_boxplotgeom_densitygeom_sfgeom_errorbargeom_ribbonand many more.
You can customize almost every aspect of a plot: axes, labels, grid
lines, legend, orientation, size, palettes. However,
ggplot2 also offers a set of themes that modify several
aspects of a figure at once:
and more are available through the ggthemes package.
Exercise
Using the iris dataset, create a new graph with a
different box plot of Sepal.Width for each Species. Label
the axes in an appropriate way. Use the theme you prefer.
Solution
ggplot(data=iris) +
geom_boxplot(mapping=aes(x=Species,y=Sepal.Width)) +
labs(y='Sepal Width') +
theme_bw()ggplotMany datasets of interest come with some geographical information
attached. It would be nice to show this dimension explicitly in our
plots and luckily ggplot makes it straightforward.
First, we use the USABoundaries package to download a
boundary file for US states.
states <- USAboundaries::us_states()
states <- tigris::shift_geometry(states)
# This is a useful function to move Alaska and Hawaii and Puerto Rico
# in a position that makes the map more compactThis is a dataframe augmented with geographic information stored in the geometry column. The content of this column is what allows us to plot the data contained in this dataframe as a map.
We are now ready to build the map.
states %>%
ggplot() +
geom_sf(mapping=aes(fill=log(aland))) +
scale_fill_viridis_c() +
labs(fill='Land Area (log)') +
theme_map()The USABoundaries package has similar boundary files for
counties, zip codes, and congressional districts and also offers the
possibility to use historical boundaries. Notice that we have used the
theme_map from the ggthemes package to make
the final plot nicer.
As for the other types of plots, we can set the border color and width, or even eliminate them entirely. Faceting also works in the same way as we saw for other plots and allows us to combine different maps into a unique figure.
Sometimes, it might seem that ggplot lacks the type of
geometry we want. Here is an example. Suppose we want to build a bar
chart in which values larger than 1 are represented by bars that raise
above the vertical line y=1 axis and values below 1 are
represented by bars that raise downward from such vertical lines (this
makes sense if the values are ratios). We would quickly realize that
geom_bar or geom_col do not allow this
behavior.
This situations are when we need to be creative. We can start by
thinking that a bar is not very different from a segment with a large
width. Once we’ve made this connection, it’s trivial to realize that
what we want are segments that start at y=1 and end at some
value yend, whether this value is larger or smaller than 1.
With the appropriate width the result will be indistinguishable from a
bar chart.
Here is how such a graph would look like:
The graph we just saw also illustrates another point. sometimes, we
want to transform one or more of the scales. Ratios, for example, are
best plotted on a log2 scale in which 2 (twice) and 0.5 (half) are
symmetric around 1. ggplot makes it easy. We just need to
add a line to the code for our plot:
plotCode + coord_trans(y='log2').
The scales package implements many common transformation
and also allows you to build your own. Here is one more example in which
the log10 transformation has been applied to the y axis to make the
visualization more compact.
Often, integrating text annotations into your plots can help making them clearer. Here is a simple example:
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
annotate("text", x = 4, y = 25, label = "Some text")This can also be used to highlight some areas:
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
annotate("text", x = 4, y = 25, label = "Some text") +
annotate("rect", xmin = 3, xmax = 4.2, ymin = 12, ymax = 21,
alpha = .2)The annotate function is very flexible when it comes to
the type of annotation but is not well suited when elements on the
annotation depend on the values of some data points. In these cases, we
can use geom_text, geom_label, or
geom_rect. Here is an example:
ggplot has a built-in function (ggsave) to
save plots to different graphic formats “eps”, “ps”, “pdf”, “jpeg”,
“tiff”, “png”, “bmp”, “svg”, and others. You can set the size of the
resulting image. By default, ggsave saves the last plot you
displayed. Here is an example:
If this is not flexible enough, you can use R’s built-in graphic devices (less intuitive). Here is an example:
There are devices for most commonly used graphic types.
ggplot with other PackagesThe list of functionalities provided by ggplot is
impressive but sooner or later you will need to do something that goes
beyond what this package can do. Luckily there is a high chance that
someone encountered the same issue before you and coded the solution as
a package. Here is a (non exhaustive) list of such packages:
patchwork: combine multiple plots in a smart and easy
to code way.ggh4x: more faceting option including nested facets!
Very useful.geofacet: provides geofaceting functionalities for
ggplot.ggridges: build amazing ridge plots.ggrepel: smart labels.colorspace: select palettes with desirable properties
for your plots.Here is an example of the basic usage of patchwork with
the mtcars dataset.
p1 <- ggplot(mtcars) + geom_point(aes(mpg, disp))
p2 <- ggplot(mtcars) + geom_boxplot(aes(gear, disp, group = gear))
p1 + p2This is a more complicated example, which illustrates the flexibility of this package.
p3 <- ggplot(mtcars) + geom_smooth(aes(disp, qsec))
p4 <- ggplot(mtcars) + geom_bar(aes(carb))
(p1 | p2 | p3) /
p4Both examples are taken from patchwork
documentation.
Here are two additional examples from my research.
We’ve already seen an example of nested facets using
ggh4x here:
Geofaceting can be a very nice way of presenting information in a compact way. Here is an example for the US.
The geofacet package offers many different grids for
different countries and levels of aggregation. However, this type of
visualization can soon become confusing if the layout is too far from
the underlying geography.
As a general advice, geofaceting works well when there are many units of a similar size and with relatively simple shapes. Also, the reader needs to be familiar with the map being represented.
The most famous example of a ridge plot is probably the cover of the Unknown Pleasures album by the British band Joy Division.
Ridges plots can be a way to make a visualization more compact. Here is a basic example from the package documentation using the iris dataset.
While this type of plot works best with densities (computed from the
data) it can also be used to plot lines (with
geom_ridgeline).
Here is an example:
The main issue is that this type of plot is harder to read with discrete data (but works well if the variable mapped to the x axis takes many different values).
Remember our plot to illustrate geom_text? It was not
very pretty because some of the labels were overlapping. Luckily the
ggrepel package has just the function we need in these
cases.
ggplot(mtcars, aes(wt, mpg, label = rownames(mtcars))) +
geom_point() +
geom_text_repel(min.segment.length = 0)Notice that not all of the points have been labeled. We can set how much overlap is tolerable if we want all the labels to appear.
ggplot(mtcars, aes(wt, mpg, label = rownames(mtcars))) +
geom_point() +
geom_text_repel(min.segment.length = 0, max.overlaps = Inf)Selecting the right palette for a plot is probably one of the hardest
part of building a visualization. Luckily, the R ecosystem offers many
packages to help you in this journey. The one I like the most, because
of it’s completeness and ease of use, is colorspace.
colorspace offers a vast selection of qualitative,
sequential, and diverging palettes which are designed to be perceptually
uniform and colorblind safe.
colorspace offers nice functionalities to test out a
palette before using it:
## all built-in demos with the same sequential heat color palette
par(mfrow = c(2, 3))
cl <- sequential_hcl(5, "Heat")
for (i in c("map","heatmap","scatter","bar","perspective","lines")) {
demoplot(cl, type = i)
}Another useful functionality is to emulate color vision deficiencies. Here is an example for a palette with poor properties:
par(mfrow = c(2, 2),mar=c(1,1,1,1))
demoplot(rainbow(11, end = 2/3)) # Original
demoplot(deutan(rainbow(11, end = 2/3))) # Deuteranomaly
demoplot(protan(rainbow(11, end = 2/3))) # Protanomaly
demoplot(tritan(rainbow(11, end = 2/3))) # TritanomalyAnd here is an example for a palette with better properties:
We encourage you to look up the documentation of
colorspace for more functionalities (including an amazing
interactive tool to create your own palettes). Other packages to deal
with colors include scico, RColorBrewer, and
paletteer.
You might have noticed how this presentation combines formatted text, R code and plots. This presentation uses RMarkdown, a system integrated in RStudio that lets you write code through a notebook interface and create reproducible documents.
If you clone the repository where I uploaded this presentation’s material you’ll be able to recreate this document with a simple click of the “Knit” button in RStudio.
You have many different output types you can choose from: html, markdown, doc, pdf (through LaTex), and others.
You can learn more about RMarkdown functionalities here.
You are going to produce two graphics using data from this project. The goal of the project was to estimate monthly excess deaths during the COVID-19 pandemic for each county in the US.
Navigate to data/output/estimates. Then download the files
estimatesStateMonths.csv and
estimatesPYears.csv. The first file has estimates
aggregated by state and month, the second file has estimates aggregated
by county and “pandemic” year (1st: Mar 2020-Feb 2021, 2nd: Mar 2021-Feb
2022, 3rd Mar 2022 - Aug 2022).
You can be creative but ideally you would build a first visualization focusing on the geographical dimension using the county level data and a second one focusing on the temporal dimension using the state level data.
If you are stuck, we are happy to help but remember that chatGPT is very good at coding! (here is an example).
When you are satisfied with your result, you can upload your work to this doc. Docs not like pdf images so you should save it to svg or png instead. Screenshots are also ok!